{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LAB 03.01 - Model Generation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!wget --no-cache -O init.py -q https://raw.githubusercontent.com/fagonzalezo/ai4eng-unal/main/content/init.py\n", "import init; init.init(force_download=False); init.get_weblink()\n", "init.endpoint" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from local.lib.rlxmoocapi import submit, session\n", "session.LoginSequence(endpoint=init.endpoint, course_id=init.course_id, lab_id=\"L03.01\", varname=\"student\");" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_moons\n", "from local.lib import mlutils\n", "from IPython.display import Image\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A machine learning task\n", "\n", "We have two species of bugs (**X bugs** and **Z bugs**), for each bug we have measured its **width** and **length**. Once we have a bug, determining if is of **species X** or **species Z** is very costly (lab analysis, etc.)\n", "\n", "**Machine learning goal**: We want to create a model so that, when given the width and length of a bug, will tell us whether it belongs to **species X** or **species Z**. If the model performs well, we might use it insted of the lab analysis.\n", "\n", "**To train a machine learning model** we built a **training dataset** where we have **annotated** 20 bugs with their **confirmed** species. The training dataset has:\n", "\n", "- 20 data items\n", "- two data columns (**width** and **length**)\n", "- one label column, with two unique values: **0 for species X**, and **1 for species Z**.\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(20, 3) (20, 2) (20,)\n", "[[0.5 0.65]\n", " [0.75 0.34]\n", " [0.37 0.5 ]\n", " [0.57 0.74]\n", " [1. 0.69]]\n", "[0. 1. 1. 0. 1.]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
widthheighty
00.500.650.0
10.750.341.0
20.370.501.0
30.570.740.0
41.000.691.0
\n", "
" ], "text/plain": [ " width height y\n", "0 0.50 0.65 0.0\n", "1 0.75 0.34 1.0\n", "2 0.37 0.50 1.0\n", "3 0.57 0.74 0.0\n", "4 1.00 0.69 1.0" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "d = pd.read_csv(\"local/data/trilotropicos_small.csv\")\n", "X,y = d.values[:,:2], d.values[:,-1]\n", "print (d.shape, X.shape, y.shape)\n", "print (X[:5])\n", "print (y[:5])\n", "d.head()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since it is just two columns, we can visualize it" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "\n", "plt.scatter(X[y==0][:,0], X[y==0][:,1], color=\"blue\", label=\"X bug\")\n", "plt.scatter(X[y==1][:,0], X[y==1][:,1], color=\"red\", label=\"Z bug\")\n", "plt.xlabel(\"width\");plt.ylabel(\"length\"); plt.legend(); plt.grid();\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1. Manually use a predictive model\n", "\n", "We give you a procedure somewhat calibrated so that, given a new bug, it produces a prediction. The procedure depends on two parameters $\\theta_0$ and $\\theta_1$. Given the width $w^{(i)}$ and height $h^{(i)}$ of bug number $i$, the prediction $\\hat{y}^{(i)} \\in \\{0, 1\\}$ is computed as follows:\n", "\n", "$$\\hat{y}^{(i)} = 0\\text{ if }w^{(i)}<\\theta_0\\text{ AND }h^{(i)}>\\theta_1;\\;\\;\\;\\;\\;\\text{otherwise }\\hat{y}^{(i)}=1$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be considered as a **model template**, depending on two parameters.\n", "\n", "\n", "Complete **the following function** so that whenever given a `numpy` array `X` $\\in \\mathbb{R}^m \\times \\mathbb{R}^2$ containing the width and height of $m$ bugs, returns a vector $\\in \\mathbb{R}^m$ with the predictions of the $m$ bugs as described in the expression above. The parameter `t` $\\in \\mathbb{R}^2$ contains, in this order, $\\theta_0$ and $\\theta_1$\n", "\n", "Observe that your function must return a `numpy` vector of **integers** (not booleans). \n", "\n", "**CHALLENGE**: solve it with one single line of code\n", "\n", "**HINT**: use `.astype(int)` to convert a `numpy` array of booleans to integers." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "def predict(X, t):\n", " return ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "check manually your code, your predictions with the following `t` must be \n", "\n", " [1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0]\n", " \n", "with an accuracy of 0.75" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "t = np.r_[.5,.3]\n", "y_hat = predict(X, t)\n", "y_hat" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "np.mean(y==y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "observe the classification boundary that the model generates" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "mlutils.plot_2Ddata_with_boundary(lambda X: predict(X,t), X, y); plt.grid();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and with other `t` ... which is better?" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "t = np.r_[.5,.8]\n", "mlutils.plot_2Ddata_with_boundary(lambda X: predict(X,t), X, y); plt.grid();\n", "np.mean(y==predict(X,t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "observe the prediction boundaries of other models. Change the `max_depth` of the decision tree to 2. Does it look familiar?" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "mlutils.plot_2Ddata_with_boundary(LogisticRegression().fit(X,y).predict, X, y); plt.grid();" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "mlutils.plot_2Ddata_with_boundary(DecisionTreeClassifier(max_depth=5).fit(X,y).predict, X, y); plt.grid();" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVC\n", "mlutils.plot_2Ddata_with_boundary(SVC(gamma=50).fit(X,y).predict, X, y); plt.grid();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**submit your answer**" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "student.submit_task(globals(), task_id=\"task_01\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2. Fit the model\n", "\n", "Given a set of annotated data $X$, $y$ and the **model template** of the previous exercise, complete the following function that returns $\\theta_0$ and $\\theta_1$ that produce the **best accuracy** on the given `X` and `y`. Consider only $\\theta_0$ and $\\theta_1$ with **one decimal number between 0 and 1**.\n", "\n", "**Hint**: use a brute force approach, consider all combinations of $\\theta_0$ and $\\theta_1 \\in$ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]. Use [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) and [`itertools.product`](https://docs.python.org/3/library/itertools.html#itertools.product)\n", "\n", "Your function must return an `numpy` array with two elements, the resulting $\\theta_0$ and $\\theta_1$" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "\n", "\n", "def fit(X,y):\n", " def predict(X, t):\n", " return ...\n", "\n", " return ..." ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [], "source": [ "t = fit(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check your solution with the code below. The `t` returned by your function should produce an accuracy of 0.9 with the example data `X`, `y`. There might be several `t` producing the same accuracy, you just have to return any of those. " ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [], "source": [ "mlutils.plot_2Ddata_with_boundary(lambda X: predict(X,t), X, y); plt.grid();\n", "np.mean(y==predict(X,t))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you can also use your model on different data. Execute the next cells several times to see the effect on different datasets." ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "bX, by = make_blobs(100,n_features=2, centers=2)\n", "bX = MinMaxScaler(feature_range=(0.1,.9)).fit_transform(bX)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [], "source": [ "bt = fit(bX, by)" ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "mlutils.plot_2Ddata_with_boundary(lambda X: predict(X,bt), bX, by); plt.grid();\n", "np.mean(by==predict(bX,bt))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**submit your answer**" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [], "source": [ "student.submit_task(globals(), task_id=\"task_02\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Make an `sklearn` compatible class with your model\n", "\n", "organize the previous methods in the following class structure. Bear in mind that:\n", "\n", "- the `fit` method now does not return `t`, which is now stored in an instance variable `self.t`\n", "- the `fit` method must now return `self`.\n", "- the `predict` method now does not accept `t` as argument, it must use the one stored in `self.t`" ] }, { "cell_type": "code", "execution_count": 294, "metadata": {}, "outputs": [], "source": [ "def SimpleModel():\n", " class _SimpleModel:\n", "\n", " def __init__(self):\n", " pass\n", "\n", " def fit(self, X, y):\n", "\n", " ....\n", " return self\n", "\n", " def predict(self, X):\n", " return ....\n", " \n", " return _SimpleModel()" ] }, { "cell_type": "code", "execution_count": 295, "metadata": {}, "outputs": [], "source": [ "m = SimpleModel()\n", "m.fit(X,y)\n", "m.predict(X)" ] }, { "cell_type": "code", "execution_count": 296, "metadata": {}, "outputs": [], "source": [ "mlutils.plot_2Ddata_with_boundary(m.predict, X, y); plt.grid();\n", "np.mean(y==m.predict(X))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "check your model with different parametrizations of the `moons` dataset (more and less data points, more and less noise)" ] }, { "cell_type": "code", "execution_count": 293, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_moons\n", "\n", "mX, my = make_moons(100, noise=.1)\n", "m = SimpleModel()\n", "m.fit(mX,my)\n", "\n", "mlutils.plot_2Ddata_with_boundary(m.predict, mX, my); plt.grid();\n", "np.mean(my==m.predict(mX))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**submit your answer**" ] }, { "cell_type": "code", "execution_count": 313, "metadata": {}, "outputs": [], "source": [ "student.submit_task(globals(), task_id=\"task_03\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }